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REGRX:  A  COMPUTERIZED  STEPWISE  REGRESSION  ALGORITHM 
WITH  RESIDUAL  ANALYSIS 


\ 

I.  INTRODUCTION 

\ 

'.y 

Regression  analysis  programs  are  commonly  used  to  develop 
prediction  systems  for  Air  Force  personnel  research,  e.g.,  a  system 
for  accurately  predicting  future  job  performance  of  enlisted 
individuals  based  on  information  obtained  during  their  Air  Force 
careers.  This  information  might  include  such  predictor  variables  as 
aptitude  or  ability  test  scores,  biographical  data,  and  physical 
attributes.  Requirements  for  the  development  of  such  prediction 
systems  within  the  technical  programs  of  the  Air  Force  Human 
Resources  Laboratory  (AFHRL)  are  numerous.  _  . 

Prior  to  implementation  of  the  UNIVAC  1108  computer  system  at 
AFHRL,  the  majority  of  regression  analyses  were  accomplished  by  the 
REGRED  single  correction  iterative  algorithm  (Ward,  Hall,  & 
Buchhorn,  1967).  This  algorithm  had  two  major  disadvantages:  (a) 
it  miqht  not  converge  if  two  or  more  variables  were  highly 
intercorrelated,  and  (b)  since  it  did  not  identify  redundant 
variables,  an  incorrect  number  of  degrees  of  freedom  could  be  used 
in  calculating  the  F-ratio  discussed  in  Section  IV.  The  convergence 
problem  was  eliminated  by  a  modified  interative  algorithm  called 
RF.GREF  (Ward  et  al.,  1967),  which  corrected  on  three  weights  per 
iteration  simultaneously;  however,  the  REGREF  algorithm  still  failed 
to  identify  redundant  variables. 

During  the  conversion  from  the  previous  computer  system  to  the 
UNIVAC  1108  computer  system,  a  computerized  regression  algorithm, 
REGRX,  specifically  tailored  to  the  requirements  of  analyses 
performed  by  laboratory  task  scientists  was  developed  to  exploit  the 
capabilities  of  the  UNIVAC  1108.  REGRX  was  implemented  to  improve 
the  laboratory's  problem-solving  capabilities  by  allowing  for 


identification  of  redundant  predictor  variables,  an  exact  solution 
at  each  step  of  the  alqorithm,  extensive  residual  analysis,  forcinq 
certain  predictor  variables  into  the  final  equation,  and  direct 
generation  of  transformed  predictor  variables. 

Shortly  after  the  REGRX  proqram  had  been  implemented  and 
thorouqhly  tested  on  the  UNIVAC  1108,  the  algorithm  was  incorporated 
as  a  subroutine,  REGRX,  into  the  TRICOR  utility  correlation  and 
regression  software  package,  which  immediately  resulted  in  improved 
analytical  capabilities  and  product  quality.  Since  that  time,  the 
REGRX  subroutine  system  has  undergone  several  modifications,  and  has 
now  been  implemented  on  the  UNIVAC  1100/81. 

The  purpose  of  this  paper  is  to  acquaint  the  potential  user  with 
the  current  capabilities  of  REGRX.  Technical  details  are  discussed 
to  enable  the  user  to  take  complete  advantage  of  the  analytical 
capabilities  of  REGRX.  This  information  includes  a  brief 
introduction  to  the  stepwise  regression  technique,  an  in-depth 
discussion  of  the  REGRX  algorithm,  a  comprehensive  listing  of  the 
comoutational  formulas  and  definitions  of  the  resulting  statistics, 
ana  a  description  of  the  algorithm's  residual  analysis  facilities. 
Specific  details  for  running  the  TRICOR  software  package  on  the 
UNIVAC  1100/81  are  available  at  AFHRL  in  an  automated  users  manual 
titled  TRICOR:  Utility  Correlation  and  Regression  System. 

II.  STEPWISE  REGRESSION  AND  MODEL  BUILDING 

The  REGRX  regression  procedure  is  a  stepwise  augmentation  and 
elimination  alqorithm.  The  stepwise  technique  (Dixon,  1968;  Draper 
&  Smith,  1966;  Efroymson,  1960;  Pope  &  Webster,  1972;  Goldberger, 
1961;  Goldberger  &  Jochems,  1961)  is  used  primarily  as  a  research 
tool  to  aid  in  the  screening  and  selection  of  variables  in  the 
development  of  a  mathematical  model  of  a  statistical  relationship 
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between  a  response  and  a  set  of  independent  variables.  It  is 
usually  desirable  that  a  model  of  the  response-independent  variable 
relationship  contain  as  few  independent  variables  as  possible; 
therefore,  for  those  cases  in  which  a  large  number  of  variables  are 
identified  as  having  some  influence  on  the  response,  it  is  necessary 
that  some  form  of  variable  selection  be  performed. 

The  stepwise  algorithm  is  a  systematic  process  for  adding 
variables  to  or  deleting  variables  from  a  given  initial  linear 
model.  First,  the  response  variable  is  regressed  on  the  set  of 
independent  variables  comprising  the  initial  model.  At  each 
subsequent  steo,  a  new  regression  equation  is  derived  from  the 
equation  at  the  previous  step  either  by  deleting  a  variable  for 
which  the  partial  F-statistic  testing  for  a  zero  coefficient  falls 
below  a  preassigned  value  or  yy  adding  a  variable  for  which  the 
partial  F  exceeds  a  preassigned  value.  At  some  point,  this  process 
of  adding  and  deleting  variables  is  interrupted,  and  the  variables 
in  the  final  regression  equation  are  taken  as  the  components  of  a 
new  model.  Oixon  (1968)  provides  a  more  complete  description  of  how 
the  stepwise  procedure  may  be  incorporated  into  a  model  building 
program.  The  REGRX  stepwise  algorithm  is  discussed  in  more  detail 
in  the  following  paragraphs.  In  addition.  Appendix  A  provides  a 
general  summary  of  the  correlation  approach  to  regression,  and 
Appendix  8  provides  the  computational  details  of  the  REGRX  algorithm. 

III.  DESCRIPTION  OF  REGRX  ALGORITHM 

At  each  step  of  the  algorithm,  the  independent  variables  are 
divided  into  two  sets,  L  and  E.  |_  is  the  set  of  variables  in  the 
regression  equation  for  the  current  step.  Set  E  contains  all 
independent  variables  that  are  not  contained  in  L.  Thus,  when  a 
variable  is  added  to  L,  it  is  simultaneously  deleted  from  E  and  vice 
versa. 
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Initially,  set  L  does  not  contain  any  variables  and  set  E  may 
contain  a  set  of  "forced"  variables  that  have  been  designated  by  the 
task  scientist  to  appear  in  all  calculated  regression  equations-  If 
no  variables  are  designated  as  "forced,"  the  first  variable  to  be 
added  to  L  is  the  independent  variable  most  highly  correlated  with 
the  dependent  variable.  If  set  E  contains  "forced"  variables,  the 
first  variable  to  be  added  to  L  is  the  "forced"  variable  most  highly 
correlated  with  the  dependent  variable;  the  other  "forced"  variables 
are  considered  for  addition  to  L  by  the  stepwise  procedure  before 
the  remaining  E  variables.  The  set  of  "forced"  variables  that  are 
added  to  L  are  denoted  by  F.  A  given  variable  designated  as 

"forced"  is  not  allowed  to  be  an  element  of  L  if  the  squared 

multiple  correlation  coefficient  for  the  regression  of  the  given 
variable  on  the  set  L  is  greater  than  or  equal  to  1.0  -  TOL  where 
the  value  of  TOL  can  range  rrom  10  to  10"  .  The  user  should 
specify  a  set  of  independent  variables  as  "forced"  if  the  regression 
problem  requires  that  these  predictors  be  present  in  the  final 

regression  equation. 

At  each  subsequent  step,  the  stepwise  procedure  regresses  the 

dependent  variable  and  each  variable  in  E  on  the  variables  in  L,  and 
one  of  the  following  outcomes  occurs: 

1.  A  variable  in  L  -  F  (the  set  of  variables  remaining  in  L 
after  the  variables  in  F  have  been  removed  from  consideration)  is 
deleted  from  L  if  the  partial  F-statistic  testing  for  a  zero 
coefficient  is  less  than  a  preassigned  value  and  if  no  other 
variable  in  L  -  F  has  a  smaller  partial  F. 

A  variable  in  E  is  added  to  L  if  (a)  no  variable  in  L  -  F 
satisfies  the  removal  criterion;  (b)  the  squared  multiple 
correlation  coefficient  for  the  regression  of  the  added  variable  on 
the  set  L  is  less  than  1.0  -  TOL;  (c)  after  adding  the  variable  to 
L,  the  partial  F-statistic  testing  for  a  zero  coefficient  exceeds  a 
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preassigned  value;  and  (d)  no  other  variable  in  E  satisfies  (b)  and 
has  a  larger  partial  F. 

3.  If  neither  of  the  preceding  outcomes  occurs  or  if  the  number 
of  steps  exceeds  a  preassigned  number,  then  the  stepwise  procedure 
terminates. 


IV.  REGRESSION  OUTPUT  WITH  COMPUTATIONAL 
FORMULAS  AND  COMMENTS 


The  following  information  describes  all  of  the  printed  output  that 
can  be  generated  by  the  REGRX  subroutine  system  except  for  the 
residual  plots  described  in  Section  V.  Subsections  A  to  F  are  output 
produced  for  each  step  printed.  Subsections  G  to  K  are  optional 
output.  Items  appearing  in  upper-case  letters  are  presented  exactly 
as  they  appear  in  the  printed  output.  Details  on  printing  options 
available  are  stated  in  an  automated  users  manual  titled  TRICOR: 
Utility  Correlation  and  Regression  System. 


A.  MULTIPLE  RSQ :  Coefficient  of  simple  determination  between  the 

predicted  scores  and  the  observed  values  for  the  dependent 

variable.  If  a  is  the  regression  constant  and  b.  is  the 

t  h  ■ 

estimated  regression  coefficient  for  the  j  variable,  then  the 
t  h 

kL  predicted  score  is 


where  x 


jk 


yk  =  a  +  v  bj  xjk 

j=l 

is  the  observation  of  variaple  j  and  p  is  the 


number  of  predictors  in  the  prediction  equation. 
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B.  ANALYSIS  OF  VARIANCE 


1. 


Regression  Degrees  of  Freedom: 
predictors  in  step  i 


number  of 


n 

2.  Regression  Sum  of  Squares:  SSReg  «  £  (yik  -  myi)2 

k=1 


3. 

4. 

5. 


where  n  =  number  of  observations 


ik 


t  h 

=  predicted  score  of  kc  observation  in  step  i 


in- .  =  mean  predicted  score  in  step  i 


Regression  Mean  Square:  MSReg  =  ssReg/0f:Reg 
Residual  Degrees  of  Freedom:  DFRgs  =  n  -  p.  -  1 


Residual  Sum  of  Squares:  SSRes  =  £  r 


k=  1 


ik 


^  t  h 

where  r-k  -  y-k  -  yik  =  residual  of  k  observation 
in  step  i 


6.  Residual  Mean  Square:  MSRgs  =  SSRe s /DFRg s 

7.  F-Ratio  -  (Regression  Mean  Square)/(Residual  Mean  Square) 

C.  STD  ERR  EST  =  yj  MSRe$ 

D.  REG  CONST:  Estimate  of  the  mean  response  when  all  of  the 
predictors  have  a  value  of  zero. 
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E.  VAR: 


f 

ft 

f 

Variable  (ID  and  name)  to  enter  the  prediction  system  in 
step  i  ' 

% 


F.  Prediction  System  Table 


1.  VARIABLE:  Variable  ID  and  name  for  each  predictor  in  the 
system  in  step  i 


2.  REGRESSION  WEIGHT  (b..):  Estimates  of  the  regression 

*  J 

parameters  ft.,  for  variable  j  in  s|ep  i  which  indicate  the 

1  J 

change  in  the  mean  response  associated  with  a  unit  change  in 
the  corresponding  predictor  variable'“when  all  other  predictor 
variables  are  held  constant 


3.  STANDARD  WEIGHT  (B.^):  If  SD^  is  the  standard  deviation 
of  the  dependent  variaole  and  SD ^  is  the  standard  deviation 
of  variable  j,  then 


4.  SQ  CORRELATION  VARIABLE  VS  REST  (R?  -) :  The  squared  multiple 

correlation  coefficient  for  the  regression  of  variable  :  on 
all  of  the  other  predictors  in  the  prediction  system  in  step  i 


5.  STANDARD  DEVIATION  OF  REGRESSION  WEIGHTS  (SD  .): 

wy 
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If  variable  j  Is  uncoijrelated  with  the  other  predictors,  then 

SDwj 

Note  that  this  is  the  smallest  value  that  SD  .  can  assume. 

WJ 

6.  INDEPENDENT  CONTRIBUTION  ( A R?  .) :  Amount  by  which  R?  would 
decrease  if  variable  j  were  removed  from  the  prediction  system 

AR?j  =  1  MSRes  bij  =  1  MSRes 

(n-1)  (S0y)2  (S0wj)2  (n-1)  (S0y)2 

where  F--  is  the  partial  F-statistic  for  testing  the  null 
1 J  th 

hypothesis  that  the  j  partial  regression  coefficient  at 
step  i  equals  zero 

7.  SQUARED  PARTIAL  CORRELATION  (^j  .  a1l  otKer  predictors)  : 

The  marginal  contribution  of  predictor  j  in  the  proportionate 
reduction  in  the  variance  of  the  dependent  variable  when  all 
of  the  other  predictors  have  already  been  included  in  the 
prediction  system 

rZ 

yj  .  all  other  predictors 

where  R?  is  the  squared  multiple  correlation  coefficient  for 
i 

the  regression  of  the  response  variable  on  all  of  the 
predictors  for  step  i 
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REGRESSION  SUMMARY  TABLE 

For  each  step,  the  summary  table  gives  the  step  number,  the 
ID  of  the  variable  entered  or  removed,  the  coefficient  of  multiple 
correlation  and  coefficient  of  multiple  determination  for  the 
regression  of  the  dependent  variable  on  all  of  the  predictors  in 
the  prediction  system,  the  change  in  the  coefficient  of  multiple 
determination  from  the  previous  step,  the  residual  mean  square, 
the  square  root  of  the  residual  mean  square,  the  F-ratio,  the 
partial  F  value,  and  the  number  of  predictors  in  the  prediction 
system. 


The  formula  used  for  computation  of  the  F-ratio  for  step  i  is 

” '  (-^)  {4r) 

The  partial  F-ratio  is  directly  related  to  the  independent 
contribution  for  the  variable  entered  at  each  step.  An  alternate 
computational  formula  for  this  statistic  is 


<R?  -  Rli 


where  Rf  =  coefficient  of  multiple  determination  at  step  i 

9 

Ri_l  =  coefficient  of  multiple  determination  at  step  i-1 
Rmax  =  the  larger  of  R^  and  Ri_] 

df i  =  difference  in  the  numbers  of  predictors  in  the 
prediction  systems  corresponding  to  steps  i  and  i-1; 
consequently,  df^  will  always  equal  1 
df^  =  n-p-1,  with  p  being  the  number  of  predictors 
in  the  prediction  system  for  the  step  corresponding  to 
the  larger  of  R?  and  R? 
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This  statistic  is  identical  to  the  traditionally  used  comparison 
of  "full  model"  versus  "restricted  model"  in  the  iterative  REGREF 
regression  algorithm  (Bottenberg  &  Ward,  1963). 


H.  LINEAR  DEPENDENCIES 

If  the  correlation  matrix  is  not  of  full  rank,  i.e.,  some  of 
the  predictor  variables  are  redundant,  then  the  least  squares 
normal  equations  will  not  have  a  unique  solution.  Least  squares 
parameter  estimates  can  still  be  obtained  (Rao  &  Mitra,  1971; 
Searle,  1971);  however,  there  will  be  infinitely  many  estimates, 
all  equally  good.  An  alternative  is  to  identify  redundancies  and 
assign  zero  weights  to  the  redundant  variables,  thereby 
eliminating  them  from  the  prediction  system. 

At  each  step,  the  REGRX  algorithm  computes  a  regression  for 
each  candidate  entry  variable  on  all  of  the  variables  in  the 
prediction  system.  If  the  coefficient  of  multiple  determination 
for  any  of  these  regressions  is  greater  than  1  -  TOL,  where  the 
value  of  TOL  is  specified  by  the  user  and  ranges  from  10"1  to 

O 

10"  ,  the  variable  is  considered  redundant  and  will  not  be 
allowed  to  enter  the  prediction  system.  When  a  variable  is 
identified  as  redundant  in  this  way  and  at  least  one  of  the 
standardized  partial  regression  coefficients  is  greater  than  or 

_5 

equal  to  10  ,  the  ID  for  the  redundant  variable  is  printed  with 

the  corresponding  regression  coefficients.  The  "intercept" 
printed  is  the  regression  constant  for  the  prediction  equation.  A 
variable  is  also  considered  redundant  if  its  entry  into  the 

prediction  system  would  cause  a  linear  dependency  among  those 
variables  in  the  augmented  system. 

The  value  of  TOL  should  be  selected  with  care.  Choice  of  an 
ideal  value  for  TOL  depends  greatly  on  the  data  set  to  be 

analyzed.  There  may  be  a  tendency  to  choose  small  values  for  TOL 
to  allow  as  many  variables  as  possible  to  enter  the  prediction 
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system.  However,  extensive  testing  has  shown  that  known  linear 
dependencies  may  not  be  identified  by  REGRX  if  TOL  is  set  very 
low.  Moreover,  when  the  value  of  TOL  is  set  at  a  small  value,  the 
following  three  undesirable  situations  are  more  likely  to  occur: 
(a)  severe  computational  accuracy  problems,  (b)  large  standard 
errors  for  the  regression  coefficients,  and  (c)  results  being 
adversely  affected  by  slight  data  recording  errors.  On  the  other 
hand,  choosing  a  value  for  TOL  that  is  too  large  may  exclude 
predictor  variables  that  are  functions  of  other  predictor 
variables  even  when  those  variables  would  contribute  greatly  to 
predictive  efficiency.  Unless  specified  otherwise,  the  value  of 

_3 

TOL  is  set  to  10  ,  which  has  been  shown  to  be  suitable  for  most 

applications. 

I .  RANGE  TABLE 

The  Range  Table  gives  the  means,  standard  deviations,  and 
maximum  and  minimum  values  for  the  observed  values  for  the 
dependent  variable,  predicted  scores,  and  residuals.  The  table 
also  gives  the  residual  variance  and  the  coefficients  of 
correlation  and  determination  between  the  observed  values  for  the 
dependent  variable  and  the  predicted  scores,  and  between  the 
residuals  and  the  predicted  scores. 

The  maximum  and  minimum  range  values  for  the  observed  values 
for  the  dependent  variable  and  the  predicted  scores  should  be 
comparable.  Vast  differences  between  maximums  or  minimums,  as 
compared  to  the  residual  standard  deviation,  may  indicate  either 
an  error  in  the  data  or  the  inability  to  predict  extreme  values  of 
the  criterion,  suggesting  additional  terms  need  to  be  included  in 
the  model. 

.J.  TABLE  OF  LARGEST  RESIDUALS 

The  table  of  largest  residuals  gives  the  case  identification 
number,  predicted  score  for  the  dependent  variable,  and  residual 
and  predictor  values  (includes  variable  name  and  identification 


15 


number  for  all  variables  in  the  prediction  system)  for  the  cases 
associated  with  the  X  largest  residuals.  X  is  the  lesser  of  the 
following  two  quantities:  10,  or  the  number  of  cases  divided  by 
20.  The  detection  of  outliers  or  data  errors  is  facilitated  by 
printing  extreme  residual  values.  The  range  values  for  the 
residuals  should  be  within  plus  or  minus  three  standard  deviations 
of  the  residual  mean.  This  is  in  accord  with  the  fact  that  in  a 
normal  population  virtually  all  points  lie  within  plus  or  minus 
three  standard  deviations  of  the  mean. 

K.  TABLE  OF  RESIDUALS 

This  table  is  printed  upon  request  by  the  user  and  lists  for 
each  observation,  the  case  identification  number,  predicted  score 
for  the  dependent  variable,  ar,d  residual  value. 

V.  RESIDUAL  PLOTS 

Descriptions  of  the  REGRX  residual  analysis  facilities  are 
presented  below.  To  complement  these  descriptions,  the  reader  is 
referred  to  Draper  and  Smith  (1966)  where  an  excellent  discussion  of 
the  analysis  of  residuals  is  presented. 

Plot  of  Residuals  vs  Predicted  Scores 

A  plot  of  residuals  versus  predicted  scores  for  a  typical  REGRX 
problem  is  shown  in  Figure  1.  The  residual  axis  appears  vertically  on 
the  page.  Two  scales  are  qiven:  (a)  the  standardized  residual,  and 
(b)  the  residual  itself.  The  two  rows  immediately  above  and  below  the 
plot  represent  the  predicted  score  axis.  The  first  and  last  entries 

on  this  axis  are  the  smallest  and  largest  predicted  scores.  If  r. 

t  h  ^ 

and  s  are  the  j  residual  and  the  residual  standard  deviation, 

r 

respectively,  then  the  standardized  residual  is  r./s  .  When  the 

J  r  ry 

residuals  follow  a  normal  distribution  with  variance  sj:,  the 
standardized  residuals  follow  a  normal  distribution  with  variance 
unity.  Thus,  approximately  95%  of  the  residuals  would  be  expected  to 
fall  between  -2  and  +2  on  this  axis. 
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A  cell  is  defined  as  the  intersection  of  a  row  and  a  column  of  the 
qraph.  Each  cell  may  have  several  points  plotted  within  it.  The 
maximum  cell  frequency,  i.e.,  the  largest  number  of  points  plotted 
within  a  cell  of  the  graph,  is  printed  at  the  top  of  the  plot.  In 
Figure  1,  MAXIMUM  CELL  FREQ  =  3.  The  number  of  points  plotted  within 
a  cell  is  indicated  by  a  numeric  character  or  asterisk.  The  numeric 
character  L  indicates  that  between  10L  and  10(L+1)  percent  of  the 
maximum  cell  frequency  points  were  plotted  in  the  respective  cell.  An 
asterisk  indicates  that  the  cell  contains  exactly  MAXIMUM  CELL  FREQ 
points.  For  example,  a  numeric  character  4  with  a  MAXIMUM  CELL 
FREQ=20  would  indicate  that  between  40  and  50  percent  of  20  points, 
i.e.,  between  8  and  10  points,  are  plotted  in  that  particular  cell. 
Truncation  is  assumed.  If  MAXIMUM  CELL  FREQ  =  5,  then  between  40  and 
50  percent  of  5  points,  i.e.,  2  points,  are  plotted  in  a  cell  where 

L=4 .  Similarly,  a  numeric  character  of  2  would  indicate  that  exactly 
one  point  is  plotted  in  the  cell.  The  printing  of  cell  frequency 
count  indicators  takes  precedence  over  all  other  printing  requirements. 

A  row  of  equal  signs  and  a  column  of  periods  identify  the  residual 
mean  and  the  predicted  score  mean,  respectively.  The  REGRX  algorithm 
performs  a  quadratic  regression  of  the  residuals  on  the  predicted 
scores,  i.e.,  the  independent  variable  appears  in  the  first  and  second 
degree  in  the  model.  The  dashes  in  Figure  1  depict  quadratic 
regressions  of  the  residuals  on  the  predicted  scores  for  the  set  of 

points  located  above  the  initial  quadratic  regression  and  for  the  set 

of  points  located  below  the  initial  quadratic  regression. 

This  plot  is  useful  in  detecting  heteroscedasticity  (unequal 
variances  for  error  terms)  and  model  inadequacies.  If  neither  of 

these  abnormalities  is  present,  the  plotted  points  should  appear  as  a 
random  scatter  of  points  about  a  line  parallel  to  the  predicted  score 
axis  and  intercepting  the  residual  axis  at  zero.  However,  a 
systematic  pattern  of  points  such  as  a  wedge  (heteroscedasticity)  or 
curvilinear  (model  inadequacy)  shape  signals  the  need  for  corrective 
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action.  If  heteroscedasticity  is  present,  the  analyst  should  consider 
the  use  of  weiqhted  least  squares  or  various  transformations  of  the 
dependent  variable  such  as  ^7,  1/Y,  and  log  Y.  Similarly,  methods  of 
dealinq  with  model  inadequacy  are  transforminq  the  dependent  variable 
or  including  additional  terms  in  the  model  such  as  square  or 

cross-product  terms. 

This  plot  is  also  useful  in  detecting  outliers  (points  that  are 
more  than  three  standard  deviations  from  the  residual  mean).  Since 

the  least  squares  fit  is  "pulled"  disproportionately  toward  these 
observations,  outliers  should  be  carefully  examined  to  determine  if 
they  convey  important  information  about  the  analysis  or  if  they 

resulted  from  a  procedural  error  such  as  a  miscalculation,  inaccurate 
recording  or  equipment  malfunction.  In  general,  an  outlier  should  not 
be  eliminated  from  the  analysis  unless  the  task  scientist  can  identify 
an  error  source  causing  the  extreme  value. 

The  plotted  points  in  Figure  1  exhibit  no  severe  abnormal ities 
such  as  thoce  exhibited  in  Draper  and  Smith  (1966);  however,  the 

dashes  do  show  a  slight  curvilinear  trend  in  the  data. 

Residual  Frequency  Plot 

This  plot  (shown  in  Figure  1)  is  printed  on  the  same  page  as  the 
previously  discussed  plot.  The  residual  axis  appears  vertically  on 
the  page.  The  number  of  points  falling  within  each  interval  of  the 
residual  axis  is  printed  vertically  in  the  left  margin.  If  N  is  the 
total  number  of  residuals,  i.e.,  the  total  number  of  cases,  and  if  L 
is  the  frequency  count  for  an  interval  on  the  residual  axis,  then 
1001/N  percent  of  the  residuals  fall  in  this  particular  interval. 
Each  equal  sign  represents  one-fourth  percent  of  the  total  number  of 
points;  therefore,  4001/N  equal  signs  would  be  printed  on  the  line 
corresponding  to  that  interval. 

A  normal  frequency  curve  is  superimposed  on  the  frequency  plot  as 
a  series  of  plus  signs  and  "]"  symbols,  the  plus  sign  representing  an 
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overlap  between  an  equal  sign  and  a  “]"  symbol.  The  purpose  of  this 
superimposed  curve  is  to  provide  a  visual  standard  against  which  the 
observed  frequency  curve  can  be  compared.  The  chi-square  value  printed 
at  the  top  of  the  plot  provides  a  quantitative  test  of  the  hypothesis 
that  the  residuals  are  normally  distributed.  NUMBER  OF  CELLS  is  the 
number  of  intervals  (ranges  from  5  to  49  depending  on  sample  size) 
used  for  this  test.  PROBLEM  NORMAL  is  the  probability  of  exceeding  the 
chi-square  value  when  the  residuals  are  normally  distributed. 
Measures  of  the  asymmetry  (skewness)  and  flatness  (kurtosis)  are 
printed  above  the  chi-square  value. 

The  correlation  coefficients  between  the  residuals  and  the 
predicted  scores  and  between  the  residuals  and  the  squared  predicted 
scores  are  printed  below  the  plot.  The  first  of  these  correlation 
coefficients  is  expected  to  be  zero  and  the  second  is  an  indicator  of 
the  quadratic  tendency  between  the  residuals  and  the  predicted  scores. 

Cumulative  Frequency  Plot 

The  residual  axis  appears  vertically  on  the  page  in  Figure  2  in 
the  same  manner  as  for  the  plot  of  residuals  versus  predicted  scores. 

The  horizontal  axis  at  the  top  of  tne  graph  is  the  cumulative 
frequency  axis  with  the  cumulative  frequencies  given  as  fractions  of 
the  total  sample  size.  Thus,  if  N-357  is  the  total  number  of  cases,  a 

fraction  of  X= . 5 1 0  would  indicate  a  cumulative  frequency  of 

(X) (N)=( . 510) (357)=182.  The  frequency  curve  is  plotted  using  the 

symbol  "F"  or  an  asterisk.  A  curve  drawn  through  the  Fs  and  asterisks 

should  resemble  a  normal  cumulative  frequency  curve.  For  visual 

comparison,  a  normal  cumulative  frequency  curve  could  have  been 
superimposed  on  the  graph.  However,  it  is  easier  to  observe  a 

deviation  from  a  straight  line  than  it  is  from  the  normal  curve. 
Therefore,  as  an  alternative,  the  observed  frequency  plot  was 

transformed  in  such  a  manner  that  it  gives  a  straight  line  if  the 
original  plot  was  in  fact  a  normal  cumulative  frequency  curve  but 
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would  deviate  from  a  straight  line  if  the  original  differed  from  a 
normal  curve. 

The  horizontal  axis  at  the  bottom  of  the  graph  is  the  cumulative 
frequency  axis  for  the  transformed  curve.  The  transformed  cumulative 
frequency  curve  is  superimposed  on  the  cumulative  frequency  plot  using 
the  symbol  "N"  or  an  asterisk.  The  asterisk  represents  a  point  at 
which  the  original  curve  and  its  transform  intersect.  The  horizontal 
row  of  equal  signs  corresponds  to  the  residual  mean. 

Plot  of  Residuals  versus  Predictor  Values 

A  d  1  ot  of  the  residuals  versus  a  predictor  variable  is  shown  in 
Figure  3.  The  values  on  the  horizontal  axis  are  the  values  of  the 
predictor.  A  horizontal  row  of  equal  signs  and  a  vertical  row  of 
periods  identify  the  residual  mean  and  predictor  variable  mean, 
respectively.  The  correlation  coefficients  between  the  residuals  and 
the  predictor  variable  scores  and  between  the  residuals  and  the 
squared  predictor  variable  scores  are  printed  below  the  plot.  The 
residual  axis  appears  vertically  on  the  page  in  the  same  manner  as  for 
the  plot  of  residuals  versus  predicted  scores.  In  addition,  the 

maximum  cell  frequency  value,  numeric  characters,  and  asterisk  and 

dash  symbols  are  printed  as  before. 

This  plot  is  useful  in  detecting  heteroscedasticity  and  model 
inadequacies.  As  before,  the  absence  of  abnormalities  is  indicated  by 
a  random  scatter  of  points  about  a  line  parallel  to  the  predictor 
score  axis  intercepting  the  residual  axis  at  zero.  A  wedge-shaped 
point  scatter  is  indicative  of  heteroscedasticity.  Possible  correc¬ 
tive  actions  that  should  be  investigated  by  the  task  scientist  include 
the  use  of  weighted  least  squares  and  various  transformations  of  the 
dependent  variable  such  as  Y/X  -  or  Y^x". .  A  curvilinear  trend  in 
the  point  scatter  is  indicative  of  an  inadequate  model.  Possible 
corrective  actions  for  this  abnormality  include  the  use  of  various 
transformations  on  the  dependent  variable  and  adding  terms  to  the 
model  such  as  square  or  interaction  terms.  Outliers  are  also  easily 
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identified  in  these  plots.  The  plotted  points  in  Figure  3  exhibit  no 
severe  abnormalities. 
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REFERENCE  NOTE 

Specific  details  for  running  the  TRICOR  software  package  on  the  AFHRL 
IJNIVAC  1108  are  available  in  an  unpublished  automated  users  manual  at 
AFHRL  titled  TRICOR:  Utility  Correlation  and  Regression  System. 


25 


APPENDIX  A:  CORRELATION  APPROACH  TO  REGRESSION1 


Regression  methodology  is  concerned  with  the  problem  of  estimating 

the  parameters  0^ ,  0^,  0p»  and  “  ’n  the  linear  model 

Y=  a1  +  x-|  0-|  +  * •  *+XpSp+E  3  a1+X0+E.  For  this  problem,  the 

following  assumptions  are  commonly  made:  X  is  a  matrix  of  known  form 

and  the  error  component, E,  is  assumed  to  be  distributed  with  mean 

? 

vector  0  and  variance-covariance  matrix  a  !.  According  to  the 
Gauss-Markov  Theorem,  the  minimum  variance  unbiased  linear  estimators 
a,  0-|,...,0p  for  the  parameters  a,  0-|,  ....  0p  are  obtained  by 
the  method  of  least  squares.  This  method  leads  to  a  system  of  linear 
equations,  called  the  normal  equations,  which  are  solved  for  a  and  0. 


1TY  =  1 T1  a  +  1TX0 
xty=  XT1«  +  XTX0 


The  superscript  T  denotes  that  the  columns  of  X  are  the  rows  of  X 
and  the  rows  of  X1^  are  the  columns  ofX.  To  decrease  the  effects  of 
roundinq  error  in  the  computation  of  the  solution  of  the  normal 
equations,  the  observations  X. .  and  Y.  are  first  centered  and  then 
rescaled  to  standardized  form  z-.  and  y.,  where 

1  sj  sJ 

z .  •  -  ( X . .  -  X  .  }/s  • ,  y .  =  ( Y .  -  Y)/s 
ij  ij  i  r  v  j  y 

X.  j  =  jth  observation  of  variable  i 


5L  =  sample  mean  for  variable  i 

s-  =  sample  standard  deviation  for  variable  i 


^Matrices,  vectors,  and  scalars  will  be  denoted  by  uppercase 
boldface  letters,  lowercase  boldface  letters,  and  upper  or  lowercase 
regular  typeface,  respectively.  Numerically  subscripted  scalars 
identify  elements  of  matrices  (row  identification,  column 
identification)  or  vectors  (row  identification)  and  numerically 
subscripted  matrices  identify  partitioned  elements  of  matrices- 
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The  normal  equations  can  be  rewritten  in  terms  of  the  standardized 
variables  as 


9  =  Rnb 

where  g  =  -^Z^y 

Z  •  (z-jj) 

R11  =  ^ZTZ 

b  =  ^-S0 

y 

S  =  diag  (si) 

S  =  diaq  (s^)  means  that  S  is  a  diagonal  matrix  with  the  ith 
diagonal  entry  equal  to  s^.  The  estimates  0  are  calculated  by  solving 

the  system  R^b  =g  for  b  and  then  computing  0-j  _  sy  b-j . 

-  -  -  .  si 

Finally  a  is  obtained  from  a  =  Y  -  Xp0p. 
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APPENDIX  B:  COMPUTATIONAL  DETAILS  OF  REGRX  ALGOR  IT 


The  Gaussian  elimination  algorithm  is  used  to  compute  the  statis¬ 
tics  required  to  implement  the  stepwise  program  (Draper  &  Smith,  1966; 
Efroymson,  1960).  The  algorithm  depends  on  the  following  observation. 
If  P  and  Qare  nonsingular  matrices,  then  the  two  systems  R-^b  =  g 
and  PRnQH=Pg  are  equivalent  in  the  sense  that  h  is  a  solution 
of  the  second  system  if  and  only  if  Qh  is  a  solution  of  the  first. 

Hence,  if  the  second  system  can  be  solved  for  some  Pand  Q,  then  the 

solution  of  the  first  system  is  easily  derived.  In  particular,  if  P 
and  Q  are  such  that  PR^Q  is  triangular  or  diagonal,  then  the 
second  system  can  be  solved  immediately.  In  practice,  g  appears  as  a 
subvector  of  a  row  and  column  in  a  larger  matrix  R  which  also  includes 
1  as  a  submatrix. 

Any  r  x  c  matrix  A  may  be  written  in  partitioned  form  as 

A11  A12  .  .  •  •  Alb 

A  ■"  A21  ^22  •  •  •  •  A 

_Asl  As2  •  *  •  •  Asb 

S  h 

where  A-.  is  r.  x  c,  ,  £  r.  =  r  and  £  c  =  c 

ik  1  i=  1  k=l 

A  typical  partitioning  of  R  is  the  following  R  = 

In  the  REGRX  algorithm,  as  in  most  regression  algorithms,  the  (i,j) 
entry  of  R  is  the  Pearson  product  moment  correlation  coefficient 
between  variables  i  and  j  based  on  the  sample  data  for  the  regression 
problem.  The  superscript  T  denotes  that  the  column  vector  g  has  been 
transposed  into  row  vector  form. 

’  2 

See  footnote  in  Appendix  A. 
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In  the  Gaussian  elimination  algorithm,  Qis  the  identity  matrix, 
The  matrix  Pis  obtained  as  a  product  of  the  factors  P^,  P^2\... 

At  the  first  step,  P^  is  calculated  and  R  is  transformed  to 

p ( 1)r  .  Renaming  P<])  =m(  !)  andpO)R=  RO),  the  succeeding 

steps  proceed  by  calculating  p(  ’  ) ,  M(  1 }  =  Pt1)  and 

R  (i)  =  p(i) 

Let  i’  denote  the  variable  to  be  added  or  deleted  from  the 
equation  for  step  i-1.  If  variable  i’  is  added,  then  P^  is 

computed  so  that  the  i’  column  of  R^  is  the  i*  column  of  the 
identity  matrix.  If  variable  i*  is  deleted,  then  P^  is  computed 
so  that  the  i’  column  of  is  the  i'  column  of  the  identity 

matrix.  The  matrix  P^  is  equal  to  the  identity  matrix  except  in 

column  i’  .  The  i'  column  of  P^  is  chosen  in  the  following 

manner.  If  variable  i’  is  being  added  and  the  i  column  of 

is  denoted  by  (a^  a2,...,a.*  ,...,av)T,  then  the  i’  column 

ofp(i)  is  (ai,...,ai’ _i,-l,ai+i,...av)T.  If  variable  i’ 

a.- 

is  being  deleted  and  the  i’  column  of  ;  is  denoted  by 

(a1,a?,...,ai- ,...av)T,  then  the  i’  column  of  P^ 

is  (ai, . .  .,a-j’  i,-l, a,’  +],...,av)T.  Recalling  that  L  denotes 
ai’ 

the  set  of  variables  in  the  regression  equation  for  step  i  and  E 
contains  all  independent  variables  that  are  not  in  L,  it  is  easy  to 
see  that  if  j€L  (j  is  an  element  of  L),  then  column  j  of  R^  is 
equal  to  column  j  of  the  identity  matrix;  and  if  keE,  then  column  k 
of  is  equal  to  column  k  of  the  identity  matrix. 

Let  p  denote  the  number  of  elements  in  L.  Symmetrically  reorder 
the  rows  and  columns  of  R^  so  that  the  first  p  rows  and  columns 
of  the  reordered  matrix  will  coincide  with  the  rows  and  columns  of 
corresponding  to  the  elements  of  L.  Mathematically  this  is 
accomplished  by  postmultiplying  R^  and  '  by  a  permutation 

matrix  which  is  denoted  by  QL  and  premultiplying  R^  QL  and 
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M^Ql  byQf.  Thus  the  matrices  Qf  R(i)QL  andQf  M^Ql 


will  have  the  special  forms 
(1-1)  Q[  R(i)QL  = 


where  ^  is  the  p  x  p  identity  matrix 
0  is  an  (m-p)  x  p  matrix  of  Os 
Dl2  is  p  x  (m-p) 

D22  (m~P)  x  (m-P) 


(1-2)  qJm(1)ql 


S  0 

^  *m-p 


where  S  is  p  x  p 

U  is  (m-p)  x  p 

0  is  a  p  x  (m-p)  matrix  of  Os 
I  m  p  is  the  (m-p)  x  (m-p)  identity  matrix 


If  this  same  reordering  of  rows  and  columns  is  performed  on  R,  then 
from 


R(l)  =  M(l)R  the  fol  lowing  matrix  identity  must  hold. 


0LTRnl0L  .QLTM(i)RQt  =[<?LTM(i)QLj  QLTRql 


or 


(1-3) 


f'p  D12_ 

S  0 

r 

1 - 

o 

& 

ro 

1 _ 

U  *m-p 

R11  R12 


R12  R22 


where  Q[RQL  = 


R11  Rl2 


R12  R22 
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Any  c  x  d  matrix  B  is  said  to  be  partitioned  conformable  to  the 
matrix  A  if 


Bn  ••••  Blq 


Bbl  Bbq 


where  Bkj  is  ck  x  d y  £  ck=c,  and  V  d^=d. 

k- 1  j  *■  1 

The  product  C=AB  may  be  written  in  partitioned  form  as 


C11  ,Clq 


C$1  ... psq 


where  q 

U 

= 

b  A 

£  A 

k=  1 

Performing 

the 

matrix 

partitions 

gives 

(i) 

S 

=  Ri! 

(ii) 

°12 

■  Rii 

(1-4) 

(iii) 

U 

=  -^2 

(iv) 

°22 

R22‘ 

Note  that  1 

R11 

is  the 

i  -j 


>T  o- 1 


L.  Let  g  denote  the  column  of  R^  corresponding  to  variable  v  , 
where  either  v€E  or  v  is  the  variable  number  of  the  response.  Let  b 
denote  the  corresponding  column  of  D-^.  The  elements  of  g  are  the 
correlations  of  variable  v  with  the  variables  in  the  set  L  and  l-4(ii) 
implies  that  b  satisfies  the  equation  R^b  =  g.  Therefore,  the 
elements  of  b  are  the  standardized  regression  weights  for  the 


32 


regression  of  variable  v  on  the  variables  in  set  L.  1  -4 ( i v )  implies 
that  the  diagonal  entry  Dyv  of  directly  below  the  b  column 

of  D12  is  equal  to  1-  R^jg  =  1-  g^b  =  l-R^(v)  where  R?(v) 

is  the  squared  multiple  correlation  coefficient  for  the  regression  of 
variable  v  on  the  set  of  variables  L.  The  off-diagonal  elements  of 

may  be  converted  to  partial  correlation  coefficients  after 
dividing  by  the  square  root  of  the  diagonal  elements  in  the  same  row 
and  column.  Thus  the  (j,k)  element  of  divided  by  the  square 
roots  of  the  (j,j)  and  (k,k)  elements  is  the  partial  correlation 
coefficient  between  variables  v-j  and  v^  after  removing  the  linear 
influence  of  the  variables  in  the  set  L,  where  v^  and  v^  refer  to 
the  variables  occupying  the  j  and  k  columns  (and  rows)  of  D.^. 

Further  characterizations  of  the  elements  of  S,  D^,  and  D.^ 
are  obtained  through  a  careful  study  of  an  individual  step  in  the 
elimination  procedure.  Figure  B1  is  a  representation  of  the 
operations  performed  during  step  i  +  1  showing  the  transitions  R^ 
to  +  andM^  toM^i  +  1^  for  the  case  where  variable  j  is 
deleted  from  the  regression  equation.  In  Figure  Bl,  L  denotes  the  set 
of  p  variables  in  the  equation  at  step  i,  so  j€L.  denotes  a 

permutation  matrix  that  reorders  variables  so  that  all  variables  in  L 
aopear  first;  moreover,  within  L,  variable  j  appears  last  (i.e., 
pth).  Figure  82  shows  the  same  transitions  for  the  case  where 
variable  j  is  added.  In  Figure  B2,  L  denotes  the  set  of  variables  in 
the  equation  at  step  i+1,  so  j€l.  has  the  same  function  as 

described  for  Figure  Bl.  It  should  be  mentioned  that  the  variable  j 
referred  to  in  Figure  Bl  is  not  the  same  variable  j  referred  to  in 
Figure  B2.  Also,  the  partition  components  of  the  matrices  appearing 
in  Fiqure  Bl  are  not  the  same  as  the  corresponding  partition 
components  in  Figure  82  although  the  names  used  in  both  figures  are 
the  same. 
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Figure  81.  Representation  of  Matrices  Used  During  Elimination 
5tep  1+1  for  Deletion  of  Variable  j 


row  p 


I  -  jb  0 
0  \  0 


0  -  jd  I 


<?LTR(i)0 

Q,TR(i+1)Q 
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r*  “ 

f-  — 

l  0  C-  ibdTl 

A+  ibbT  -  ib  0* 

I  b  C 

A  0  0 

Os  dT 

-i  i  o 

O 
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Ah 
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_ 
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column  p 

Figure  B2.  Representation  of  Matrices  Used  During  Elimination 
Step  HI  for  Addition  of  Variable  j. 


Recall  that  to  delete  variable  j  at  step  i  +  1  the  matrix  P^1+1^ 
is  chosen  so  that  column  j  ofM^  +  ^  =P^  +  ^W^  will  be  equal 
to  column  j  of  the  identity  matrix.  In  Figure  Bl,  the  rows  and 
columns  of  the  matrices  P^  +  ^,  R^,  +  andM^1  +  ^ 

have  been  reordered  by  means  of  the  permutation  matrix  QL  for  the 
purpose  of  simplifying  their  partitioned  form.  In  the  reordered 
matrices,  the  elements  corresponding  to  variables  in  the  set  L  occupy 
the  first  p  rows  and  columns  and  the  entries  for  variable  j  occupy  the 
row  and  column.  Let  v  denote  any  variable  not  in  L.  Thus, 
either  v€E  or  v  is  the  variable  number  of  the  response.  The  component 
of  d  corresponding  to  variable  v  is  denoted  by  dy.  Similarly  the 
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diagonal  entry  of  D  corresponding  to  ciable  v  is  denoted  by  D  . 

Comparing  with  (1-1),  and  recalling  (1-4)  (ii) 

and  (iv),  it  follows  that  dv  is  the  standardized  regression 
coefficient  B^(v)  for  variable  j  in  the  regression  of  variable  v  on 

the  set  of  variables  L  and  that  0VV  =  l-H^(v).  If  v  is  a  variable 

not  in  the  set  L,  then  B^j(v)  denotes  the  standardized  regression 
coefficient  for  variable  j  in  the  regression  of  variable  v  on  the  set 
of  variables  in  L,  and  p L  ^(j.v)  denotes  the  pa-tial  correlation 
between  variables  j  and  v  after  partial  ling  out  tne  linear  influence 
of  variables  in  L-j.  A  similar  comparison  for 

Q^R(i  +  l)Q^  snows  tnat  -i-  =  1-K^  -  1-K-'  ^  (vi,  ana 

From  tnese  fc 1  at lonsn l ps ,  tne 

following  results  can  be  deriveo. 

(1)  Character i zat ion  of  tne  standardized  regression  coefficient 
in  terms  of  a  partial  correlation  coefficient. 


l-  RL2.j(j) 

{Z)  cnaracter lzat ion  of  the  standardized  regress  .cienc 

in  terms  of  tne  increase  in  tne  squared  multiple  _idtion  coef¬ 
ficient  due  tu  tne  addition  of  variable  j  '  dependent  contribution). 

“fw  -  j-jQ) 

>  -  (J) 

L-J 
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(3)  Characterization  of  the  independent  contribution  of  a  variable 
in  terms  of  the  partial  correlation  coefficient. 


R2(v)  -  r£_j(v)  =  pL_j(j’v)(1*  RL-j^)  =  dv 


(4)  Characterization  of  the  partial  F-statistic  for  the  hypothesis 
6Lj(v)  =  0. 


F.  =  RL(V)~  RL?-j<V> 


1-  Rl(v) 


P,2  i ( J , v) 

(n-p-1)  =  - 2 -  (n-p-1 )  = 

i-  p?.j(i,*) 


1-  «£<*> 


(n-p-1) 


d2 

-1—  (n-p-1) 
sDvv 


Note  that  this  last  formula  also  allows  a  characterization  of  the 
standard  error  of  the  standardized  regression  coefficient.  It  is 
2  2  2 

known  that  Fj  =  t^  =  BL  . ( v)/sL  . ( v)  where  tj  is  the  t-statistic  for 
the  hypothesis  6 l . j ( v )  =  0  and  S|_j(v)  is  the  standard  error  of  B|_j(v). 


(5)  Characteri zation  of  the  standard  error  of  the  standardized 
regression  coefficient. 


SLj(v) 


1  -  R2(v) 


1  - 


1 


(n-p-1) 


s^vv 

n-p-1 
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In  two  final  results,  a  characterization  for  the  elements  of  the 
inverse  and  the  determinant  of  the  correlation  matrix  is  obtained. 
Comparing  QTm(I)q^  with  (1-2)  (usinq  ( 1-4 )  ( i ) )  and  denotinq  the 

correlation  matrix  for  the  variables  in  the  set  L  by  R-| -j ,  it  follows 
that  the  diagonal  entry,  s,  of  R" 1  corresponding  to  variable  j  is 
equal  to  the  reciprocal  of  1-  R2_j(j).  Comparing  QT  r(  i+1 )  Ql 

with  (1-1)  and  recalling  (l-4)(ii),  it  also  follows  that  Ibis  the 

s 

vector  of  standardized  regression  weights  for  the  regression  of 
variable  j  on  the  remaining  variables  in  the  set  L.  Therefore,  the 
non-diagonal  entries  of  the  column  of  R"i  corresponding  to  variable 

j  have  a  simple  relation  to  the  standardized  regression  weights  for 
the  regression  of  variable  j  on  the  remaining  variables  in  the  set  L. 

To  obtain  the  expression  for  the  determinant,  suppose  that 

P(i+1\  P^i+2^,...,  p(i+p)  were  chosen  to  successively  delete 
variables  from  L  until  it  was  empty;  then  the  expression 

p(i+P^  .  ..P^  +  ^  =1  would  hold.  This  fact  implies 

tnat  det(P^  +  p^  ...  P^  +  ^)  =  l/det(M^).  It  is  also  known 
tnat  det(M^^)  =  det  (Rjj)  =  l/det(Rn)  and  det  (  p(i  +  1))  = 

— -  =  1-K^  (j).  A  generalization  of  this  relationship  gives 

det(P(i+^))  =  1-R2  (J2),  ....  det(P(i+P-I))  = 

J  SJ  C. 

l-Rf  .  .  ,  ( jp - 1 ) ,  det(P(  1+P) )  =  1,  where  j2,  J3  ...,  jp 

l-j-Jj,-...  Jp_1 

is  tne  order  of  deletion  of  variables.  Simplifying  this  notation  gives 
tne  following  general  result: 

det(Rn)  =  (1  .i)^-R3.2l)^"R4.32l)  **  ^  1_Rp.p- 1 . .  .1  * 

where,  for  example,  represents  the  squared  multiple 

correlation  coefficient  from  the  regression  of  variable  4  on  variables 
1  tnrougn  3. 
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The  formula  for  the  partial  F  for  entry  remains  to  be  derived. 
This  is  most  easily  accomplished  by  means  of  Figure  B2.  As  before, 
let  dv  denote  the  component  of  d  corresponding  to  variable  v  and  let 
Dvv  denote  the  diagonal  entry  of  D  corresponding  to  variable  v. 
Comparing QjR(i)Q^ and  QjR(i+1)Q^  with  (1-1)  and  recalling 

(l-4)(iv),  it  follows  that  1  -  R^_j(v)  =  °vv>  s  =  1  -  R^  j(j)> 


and  l-R^(v)  =  l)Vv 


Tnerefore,  the  increase  in  the  squared 


multiple  correlation  due  to  the  addition  of  variable  j  is 
2.2  d2 


Rl(v)  -  Rl-j(v) 


.  This  gives  the  following  computational 


formula  for  the  partial  F-statistic  for  the  entry  of  variable  j, 


R?(v)  -  R?  ,(v) 


